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SYSTEM AND METHOD FOR DETECTING TEXT SIMILARITY 
OVER SHORT PASSAGES 



FIELD OF THE INVENTION 
The present invention relates generally to natural language processing and 
5 more particularly relates to a system and method for determining the similarity of text 
in short passages. 



BACKGROUND OF THE INVENTION 
With the growing volume of textual information, such as newspaper articles, 
magazines, Intemet articles, and the like, there is a growing need to automatically 

10 cluster and/or classify such docimients and determine whether groups of documents 
express similarities or not. For the most part, research in this area has focused on 
detecting similarity between documents and large segments of text or between a short 
query phrase and one or more documents. 

While effective techniques have been developed for docviment clustering and 

15 classification which depend on inter-document similarity measures, these techniques 
generally rely only on shared words, or occasionally on collocation of words. Such 
techniques are applicable when large imits of text, such as full documents, are 
compared. In this case, there is generally sufficient overlap to detect similarity in the 
documents and/or document segments. However, when the imits of text are small, for 

20 example a paragraph or abstract, such simple surface matching of words and phrases 
is far more prone to error. In the case of small text imits, the sample size is reduced 
and the number of potential matches is reduced accordingly. Thus, there remains a 
need for improved techniques for detecting similarities between small text units. 

A further problem with known techniques for detecting similarity is that the 

25 conventional notions of similarity which are applicable to large text samples, such as 
documents and large text segments, do not provide sufficient measures of similarity 
for measuring similarity in small text segments. Standard notions of similarity 
generally involve the creation of a vector or profile of characteristics of a text 
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fragment and determine a conceptual distance between vectors on the basis of 
frequencies. Features typically include stenmied words, although multi-word units 
and collocations also have been used. Typological characteristics, such as thesaural 
features, have also been used to calculate features. The difference between vectors for 

5 one text unit (usually a query) and another text unit (usually a document) then 
determines closeness or similarity of the text units. 

In some cases, the text units are represented as vectors of sparse n-grams of 
word occurrences and leaming is applied over those vectors. Though effective in the 
context of large document comparisons, a more fine-grained distinction for similarity 

10 measures is required to properly characterize the similarity of two small text 
segments. 

SUMMARY OF THE INVENTION 
It is an object of the present invention to provide systems and methods for 
detecting similarity between two or more small text segments. 

15 A method for determining similarity in short text segments in accordance with 

the present invention includes the steps of determining common primitive featiares in 
the text segments, determining common composite features in the text segments and 
then calculating a similarity measure based upon the primitive and composite 
features. The primitive features can be selected from the group including common 

20 single words, common noun phrases, synonyms, common semantic classes of verbs, 
and common proper nouns. The composite features, which represent relationships 
between and among the primitive features, can be selected from the group including 
primitive feature order restrictions, primitive feature distance restrictions, and 
primitive type restrictions. 

25 Preferably, the step of determining conmion primitive features can include the 

further steps of identifying common primitive features, assigning a value to the 
primitive features, and normalizing the feature values. Normalizing the values can 
include normalizing for text segment length and normalizing for the frequency of 
primitive feature occurrence. Similarly, determining composite features generally 

30 includes identifying the composite features, assigning a value to the composite 
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features, and normalizing the feature values. Again, normalization of the feature 
values can include normalizing for text segment length and normalizing for the 
frequency of feature occurrence. 



BRIEF DESCRIPTION OF THE DRAWING 
5 Further objects, features and advantages of the invention will become apparent 

from the following detailed description taken in conjunction with the accompanying 
figures showing illustrative embodiments of the invention, in which 

Figure 1 is a flow chart illustrating an overview of a present method for 
comparing small text segments; 
10 Figure 2 is a flow chart illustrating the step of defining similarity for small text 

segments in accordance with the present methods; 

Figiare 3 is a flow chart illustrating the process of computing primitive features 
for use in detecting similarity in small text segments; 

Figure 4 is a flow chart illustrating the process of calculating composite 
1 5 features for use in detecting similarity of small text segments in accordance with the 
present methods; 

Figure 5 is a block diagram of a software system topology for determining 
similarity in small text segments in accordance with the present methods; 
Figure 6 is an illustration of exemplary short text segments; 
20 Figure 7 is a diagram illustrating a composite feature match between two of 

the short text segments provided in Figure 6 using a "same order" rule; 

Figure 8 is a diagram illustrating a composite feature match between two of 
the short text segments provided in Figure 6 using a "within distance" rule; and 

Figure 9 is a diagram illustrating a composite feature match between two of 
25 the short text segments provided in Figure 6 using a "primitive type" rule. 

Throughout the figures, the same reference numerals and characters, unless 
otherwise stated, are used to denote like features, elements, components or portions of 
the illustrated embodiments. Moreover, while the subject invention will now be 
described in detail with reference to the figures, it is done so in coimection with the 
30 illustrative embodiments. It is intended that changes and modifications can be made 
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to the described embodiments without departing from the true scope and spirit of the 
subject invention as defined by the appended claims. 



DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
Figure 1 is a flow chart illustrating an overview of the process used in the 

5 present invention for detecting similarity in smedl text segments. As previously noted, 
a problem in the prior art is that the definition of similarity commonly used for large 
text segments, such as docimients, is not sufficiently refined to provide an adequate 
measure of similarity when comparing small text segments. Generally, small text 
segments refer to sentences, phrases and short paragraphs. 

10 Refening to Figure 1, in step 100 a definition of similarity for small text 

segments is provided. From this definition, the method proceeds to identify primitive 
features of the small text segments and determine feature values for the primitive 
features (step 105). Primitive features are those which generally compare simple parts 
of speech and text, such as single words, word categories, or phrases such as noun 

1 5 phrases, synonyms, verb class and proper nouns. In addition to primitive features, the 
process can identify composite features of the short-text segments and determine 
composite feature values (step 1 10). Composite features are those which compare 
relationships among two or more primitive features. Once primitive features and 
composite features have been identified and given an appropriate value, a machine 

20 learning algorithm is applied to classify small text segments as similar or not similar 
(step 115). 

Figure 2 is a flow chart which illustrates the process of establishing an 
appropriate definition of similarity for small text segments. In general, two text units 
can be considered as similar if they share the same focus on a common concept, actor, 

25 object or action. In addition, the common actor or object definition must perform or 
be subjected to the same action or be the subject of the same description. This is 
exemplified in the flow chart of Figure 2, where two small text segments are selected 
from a body of text and are analyzed. If the two text segments relate to a common 
concept (step 205), then further analysis is performed to see if the common concept 

30 relates to the same action (step 210) or relates to the same description (step 215). 
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Similar tests are performed to determine if the two text segments relate to a common 
actor (step 220) or to a common object (step 225). If there is no common concept, 
actor or object, the text segments are considered not similar (step 235). Similarly, for 
those text segments which do refer or relate to a common concept, actor or object, 
5 those segments will still be found not similar unless they also relate to a common 
action or involve the same description. Thus, for short text segments to be similar, 
they must contain a common concept, actor, or object which is also the subject of a 
conmion action or description. The comparisons in steps 205, 220 and 225 can be the 
basis for primitive features 240. Those relationships between primitive features which 
10 are identified in steps 210, 215 can be referred to as composite features 245. 

While Figure 2 is illustrated as a sequential process, it represents a decision 
tree involved in a definition of similarity of two short text segments as applied in the 
present invention which can also be performed in a largely parallel maimer. For 
example, decisions 205, 220 and 225 can be performed concurrently as can decisions 
15 210 and 215. Using this definition of similarity for small text segments, a feature- 
based process can be employed which compares primitive and composite features of 
short text segments to determine if the definition is satisfied for two or more given 
input text segments. 

Figure 3 is a fiow chart which illustrates a method for extracting and scaling 
20 primitive features in accordance with the present invention. The text segments are 
compared for a level of commonality, including determining whether there is a 
common single word (step 305), a common noun phrase (step 310), whether two 
words in the phrases are synonyms (step 315), whether the phrases include verbs 
having a conmion semantic class (step 320), and whether a conmion proper noun can 
25 be foimd in the two phrases (step 325). If none of these conditions are satisfied for the 
applied small text segments, there is no primitive feature common to these two text 
segments (step 327). When a primitive feature has been identified, e.g., one of the 
conditions in steps 305 through 325 is satisfied, a feature value is assigned to that 
primitive feature. Preferably, the values which are assigned to the features are 
30 determined by a machine learning algorithm, such as RIPPER, which is trained using 
a suitable training corpus. RIPPER is a widely-used and effective rule induction 
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system which is available from AT&T Laboratories and is described by Cohen in 
"Learning Trees and Rules with Set-Valued Features, Proceedings of the Fourteenth 
National Conference on Artificial Intelligence, American Association on Artificial 
Intelligence, 1996, which is incorporated by reference. It has been found that a sub- 

5 set of a corpus of 264 paragraphs which have been manually tagged by human readers 
as similar or not similar can be used to establish a feature rule set for RIPPER which 
is then suitable for assigning values to the features identified in the text segments. 
The particular training corpus and learned rule set vsdll generally vary depending on 
the desired application. The values assigned will vary based on properties of the 

10 machine learning algorithm and treiining corpus. After feature values are assigned in 
step 330, these values can be normalized based on text length (step 335) and/or noted 
frequency of occurrence (step 340). Though normalization is optional, it is a desirable 
step to provide uniform and accurate results across varying types of text and length of 
text segments. 

15 Primitive features provide a baseline indication of similarity. To fiirther refine 

the notion of similarity in small text segments, relationships among primitive featiares, 
referred to as composite features, can also be evaluated. Referring to Figure 4, a 
method of evaluating composite features is illustrated. Composite features are those 
features which identify relationships among primitive feature pairs. Genersdly, 

20 composite features are defined by placing different forms of restrictions on 

participating primitive feature pairs. Referring to Figure 4, the primitive features 
identified in each of the small text segments are applied to a test layer 400 where 
various feature relationships are evaluated. The relationships illustrated in test layer 
400 are exemplary in nature and are not intended to illustrate an exhaustive list of 

25 possible relationships. It will be appreciated that an large number of relationships 
between and among primitive features can be used to establish composite features. 

For example, one type of feature relationship for composite features can be 
that the primitives occur in the same order in each of the text samples (step 405). This 
is illustrated by example in Figure 7. Figure 6 provides three short text segments to 

30 be compared. Figure 7 illustrates a match according to the "same order" composite 
feature rule. In Figures 7-9, primitive features are identified by shading and the 
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relationships which form the composite features are illustrated by comiecting lines. In 
the case illustrated in Figure 7 the primitive features {two, contact} appear in the same 
order in text segments Figure 6 (a) and 6 (b) from Figure 6. 

Another possible relationship is that two pairs of primitive elements are 
5 required to occur within a certain distance in both text segments. The maximum 

distance between the primitive elements which would satisfy the relationship can be a 
variable or a predetermined constant (step 410). Referring to Figure 8, an example of a 
positive match for the "within distance" composite feature rule is provided, given that 
the distance, n, is set to a value less than three. In Figure 8, although the primitive 

10 features {contact, lost} do not appear in the same order, they occur within n words of 
each other (n<3 in this case). 

Yet another exemplary relationship can be that the two text segments include 
the same primitive feature types. For example, one primitive feature can be restricted 
to a simplex noun phrase while the other to a verb. In such a case, two noun phrases, 

15 one from each text imit, must match according to the rule for matching simplex noun 
phrases and two verbs must match according to the applied rules of verb primitives 
(e.g., sharing the same semantic class). This is illustrated in Figure 9 where the 
primitive feature "An OH-58 helicopter" is deemed a simplex noun phrase match with 
"the helicopter" and both phrases include a common verb, "lost". 

20 By matching primitive feature types, a simple grammatical relationship is 

determined in the text segments. Retuming to Figure 4. for each condition that is 
satisfied in test layer 400, feature values are assigned to those composite features 
identified (step 420). The feature values are assigned by a machine learning 
algorithm, such as RIPPER, which has been trained on a suitable training corpus. As 

25 in the case of primitive features, optionally, the feature vdues assigned to the 

composite feature can be normalized for text length and relative occurrence of the 
primitive feature or composite feature (steps 425, 430, respectively). Once both 
primitive features and composite features of the small text segments have been 
identified, a machine leaming algorithm is applied to determine a similarity value 

30 between the text segments (step 435). The machine leaming algorithm can perform a 
rule-based analysis to determine similarity. Altematively, a simpler algorithm can be 
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used to detennine similarity by comparing the total feature value of the text segments 
being compared to a predetermined threshold value. 

Figure 5 is a block diagram of an exemplary software system for conducting 
the method described in connection with Figures 1-4. The system is generally 

5 implemented in software for a general purpose computer, such as a personal computer 
or work station. The system includes a main processing section 500. One or more 
interface modules 510 are included for receiving text input for the text segments to 
be compared and for providing the text segments to the main processing section 500. 
The text input can be provided by a nimiber of sources, including but not limited to, 

10 computer readable memory, hard disks, optical disks, network databases, on-line 

sources, manual keyed input and the like. Based on the desired text source and input 
mechanism, one skilled in the art can provide appropriate text input interface module 
510 hardware and software. 

The main processing section 500 is also operatively coupled to a training 

1 5 corpus 515, which is generally stored in computer readable storage media. The main 
processing section 500 is generally programmed in a structured manner which calls 
various subprograms, library routines, and the like to perform the various ftmctions 
described in accordance with Figxires 1-4. The main processing section 500 can 
invoke the various subroutines sequentially (serial) or in a parallel, or batched, 

20 processing mode. The received text is generally passed to a preprocessing routine 
520. The preprocessing routine cleans up the received text, such as by removing 
control characters from the text. The preprocessing routine also performs part-of- 
speech (POS) tagging, using known techniques, such as are available in the 
ALEMBIC tool set, described by Aberdeen et al. in "MITRE: Description of the 

25 Alembic System as used for MUC-6," Proceedings of the Sixth Message 

Understanding Conference, 1995, which is hereby incorporated by reference. 
ALEMBIC provides a set of data and language processing tools which identify the 
various parts of speech present in the small text segments. 

Following text preprocessing, control is retumed to the main processing 

30 section 500 which then preferably invokes a noun phrase comparison subroutine 525, 
such as LinkIt, to perform noun phrase comparison of step 310. Linkit can be 
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employed to determine whether a common noun phrase is present in the applied text 
segments and for identifying simplex novin phrases and matching those that share the 
same noun head. The Linkit tool is described by N. Wacholder in "Simplex NPs 
Clustered by Head: A Method for Identifying Significant Topics in a Document", 
5 Proceedings of the Workshop on the Computational Treatment of Nominals, October 
1998, which is hereby incorporated by reference in its entirety. 

To determine if two segments include conmion proper nouns as required in 
step 325, the norm comparison algorithm can also be used to match those noims 
identified using the ALEMBIC toolset using various predetermined matching criteria. 
1 0 Variations on proper noun matching can include restricting the proper noun type to a 
person, place or organization. Such subcategories can also be extracted using 
ALEMBIC'S named entity finder. 

Following noun phrase identification and matching, other routines for 
detecting primitive features can be employed. For example, to perform step 305 and 
1 5 determine whether common single word primitive features exist between two text 

segments, a word co-occurrence detection sub-routine 540 can be called by the main 
program 500. Variations of the word co-occurrence operation can restrict matching to 
cases where the parts of speech of the words also match, or relax the comparison to 
cases where only the word stems of the two words are identical. 
20 Similarly, to determine if two text segments include words which are 

synonyms, a synonym detection algorithm 530 can be called by the main processing 
routine 500. In this regard, a lexical database such as WordNet®, as described by G. 
Miller in "WordNet, An On-Line Lexical Database," International Joumal of 
Lexicography, Vol. 3, No. 4 (1990), can be employed. WordNet provides sense 
25 information and places words in sets of synonyms (synsets). Words that appear in the 
same synset are generally considered matches. Variations on this feature can be used 
to restrict the words being compared to a specific part-of-speech class. 

To determine if two verbs present in the short text segments are of the same 
semantic class as set forth in step 320, a verb classifier and comparator algorithm 535 
30 can be operatively coupled to the main processing section 500 and called by the main 
program. Semantic classes for verbs have been found to be useful for determining 
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document types and text similarity. This is discussed, for example, in "The Role of 
Verbs in Document Analysis" by J. Klavans et al.. Proceedings of the 36th Annual 
Meeting of the Association for Computational Linguistics and the 1 7th International 
Conference on Computational Linguistics, 1998, which is hereby incorporated by 

5 reference in its entirety. For those verbs which are found to have a common semantic 
class, e.g., communication, motion, agreement, argument, etc., those verbs are 
considered to match. 

The program operating in main processing section 500 can also provide 
algorithms to normalize feature values for text lengths and relative occurrence of the 

10 primitive. To normalize feature values for text length, as set forth in step 335, each 
feature value can be normalized by the size of the textual segments in the pair. For 
example, for a pair of textual segments A and B, the feature values assigned are 
divided by a normalization value, N: 



This operation removes any potential bias in favor of longer text segments. It is noted 
1 5 that the units involved in the lengths of A and the lengths of B are generally measured 
by a word count. 

Normalization of feature values can also be based on the relative frequency of 
occurrence of each primitive feature. Such normalization is motivated by the general 
observation that infrequently matching primitive elements are likely to have a higher 

20 impact on similarity than primitives which match more firequently. Such 

normalization is similar to the document frequency component of the commonly 
employed TF*IDF calculation. In this case, each primitive feature is associated with a 
value which is equal to the number of textual units in which the primitive appeared in 
the corpus. For a primitive element which compares single words, this is the number 

25 of text segments which contain that word in the corpus; for a noun phrase, this is the 
number of textual units that contain noun phrases that share the same head; and 
similarly for other primitive types. We multiply each feature's value by: 




(1) 



(2) 
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where T is a number of textual segments and N is the number of textual segments 
containing the primitive. It is noted that since normalization for text length and 
frequency of occiarrence are both optional operations, when these two normalization 
techniques are selectively applied, there are up to four variations of normalizations for 

5 each primitive feature. Of course, other normalization techniques may be added to, or 
substituted for, the two methods discussed herein. 

The program in main processing section 500 generally employs a machine 
learning algorithm 545 to determine whether the text units match overall. A suitable 
machine leaming algorithm is RIPPER, as disclosed by Cohen in "Learning Trees and 

10 Rules with Set- Valued Features, Proceedings of the Fourteenth Nationd Conference 
on ArtificiEil Intelligence, American Association on Artificial Intelligence, 1996, 
which is incorporated by reference. RIPPER is a widely-used and effective rule 
induction system. This RIPPER algorithm is trained over a corpus of manually 
marked pairs of text units continued in the training corpus 515. A suitable corpus was 

15 constructed using a subset of the Topic Detection and Tracking (TDT) corpus 

developed by NIST and DARPA. The TDT corpus in a collection of over 16,000 
news articles from Reuters and CNN where many of the articles have been manually 
grouped into 25 categories each of which correspond to a single event. The selected 
corpus was formed using the Reuters' articles in five of the twenty five categories 

20 from randomly selected days. The resulting training corpus 515 contained 30 related 
articles. The 30 articles provided 264 paragraphs which were selected as the small 
text segments and resulted in 10,345 comparisons between segments. 

Although use of a machine leaming algorithm is preferred, other algorithms 
can also be used. For example, an algorithm can add the total value of composite 

25 features found in the text segments and compare this value against a similarity 

threshold. Similarly, although it is preferred to determine feature values based on the 
use of a machine leaming algorithm, feature values can be predetermined based on 
hmnan experience through the use of a look-up table. Alternatively, all features can 
be given a binary value and the similarity comparison can be determined based on a 

30 simple accumulated coimt of detected primary and composite features. 
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The present methods, while evaluated on a corpus of English language 
documents, are not language specific and are generally applicable to any language. Of 
course, the individual subroutines may require some alteration to accommodate the 
varied constructions found in different languages. 
5 The methods for determining similarity in small text segments described 

herein form an important component in larger systems, such as document archiving 
systems and multi-document sunmiarization systems. 

Although the present invention has been described in connection with specific 
exemplary embodiments, it should be understood that various changes, substitutions 
10 and alterations can be made to the disclosed embodiments without departing firom the 
spirit and scope of the invention as set forth in the appended claims. 
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CLAIMS 

L A method for determining similarity in short text segments comprising: 
determining common primitive features in the text segments; 
determining common composite features in the text segments; 

5 and 

calculating a similarity measure based upon said primitive and 

composite features. 

2. The method for determining similarity as defined by claim 1 , wherein 
10 said primitive features are selected from the group including common single word, 

common noun phrase, synonyms, common semantic class of verbs, and common 
proper nouns. 

3. The method for determining similarity as defined by claim 1, wherein 
said composite features are selected from the group including primitive feature order 

15 restrictions, primitive distance restrictions, and primitive type restrictions. 

4. The method for determining similarity as defined by claim 1 , wherein 
said step of determining common primitive features includes: 

identifying common primitive features; 
assigning a value to said primitive features; and 
20 normalizing said value. 

5. The method for determining similarity as defined by claim 4, wherein 
said step of normalizing includes at least one of normalizing for text segment length 
and normalizing for frequency of primitive occurrence. 



25 



6. The method for determining similarity as defined by claim 1 , wherein 
said step of determining common composite features includes: 

identifying common primitive features; 
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assigning a value to said primitive features; and 
normalizing said value. 



7. 



The method for determining similarity as defined by claim 6, wherein 



said step of normalizing includes at least one of normalizing for text segment length 
5 and normalizing for frequency of primitive occurrence. 



operatively couple to the interface circuit and operating under the control of a 
10 computer program, the program performing operations to determine common 

primitive features in the text segments, determine common composite features in the 
text segments; calculate a similarity measure based upon said primitive and 
composite features, and provide an output indicative of the similarity measure. 

15 9. The system for determining similarity as defined by claim 8, wherein 

said primitive features are selected from the group including common single word, 
conmion noun phrase, synonyms, common semantic class of verbs, and common 
proper noims. 

10. The system for determining similarity as defined by claim 8, wherein 
20 said composite features are selected from the group including primitive feature order 

restrictions, primitive distance restrictions, and primitive type restrictions. 

1 1 . The system for determining similarity as defined by claim 8, wherein 
the processing operation of determining conunon primitive features includes: 

identifying conmion primitive features; 
25 assigning a value to said primitive features; and 



8. 



A system for determining similarity in short text segments comprising: 
an interfeice circuit for receiving text segments for comparison; 
a main processing section, the main processing section being 



normalizing said value. 
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12. The system for determining similarity as defined by claim 11, wherein 
the processing operation of normalizing includes at least one of normalizing for text 
segment length and normalizing for frequency of primitive occurrence. 



13. The system for determining similarity as defined by claim 8, wherein 
5 said processing operation for determining common composite features includes: 

identifying common primitive features; 
assigning a value to said primitive features; and 
normalizing said value. 



14. The system for determining similarity as defined by claim 13, wherein 
10 said processing operation for normalizing includes at least one of normalizing for text 
segment length and normalizing for frequency of primitive occurrence. 



1 5. The system for determining similarity as defined by claim 8, wherein 
the computer program includes a noun phrase identification subroutine, a synonym 
detection subroutine, a verb classifier subroutine and a word co-occurrence 
15 subroutine. 



1 6. The system for determining similarity as defined by claim 8, further 
comprising a computer readable training corpus, and wherein the computer program 
includes a machine learning algorithm operatively coupled to the training corpus for 
learning and applying a rule set for determining similarity in small text segments. 
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with the helicopter about 9:15 EST (0215 GMT)." 
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